This notebook demonstrates model workflows from the tidymodels R package, using {targets} as a explanatory tool.
Disclaimer: The actual fitting and modeling in this notebook don’t represent best practices but rather serve to demonstrate workflows. In reality you would want to tune each of the models using cross-validation on the train set. Additionally, you’d want to define different recipes for each model type in the workflow_set() function.
At a high-level, a workflow object uses all or some of these elements:
We will be fitting the following models: lm_model, rf_model, xgb_model
Let’s take a look at the overall pipeline:
tar_glimpse()
## ── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.2 ──
## [32m✓[39m [34mbroom [39m 0.7.3 [32m✓[39m [34mrecipes [39m 0.1.16
## [32m✓[39m [34mdials [39m 0.0.9 [32m✓[39m [34mrsample [39m 0.0.9
## [32m✓[39m [34mdplyr [39m 1.0.2 [32m✓[39m [34mtibble [39m 3.1.1
## [32m✓[39m [34mggplot2 [39m 3.3.2 [32m✓[39m [34mtidyr [39m 1.1.2
## [32m✓[39m [34minfer [39m 0.5.3 [32m✓[39m [34mtune [39m 0.1.5
## [32m✓[39m [34mmodeldata[39m 0.1.0 [32m✓[39m [34mworkflows[39m 0.2.2
## [32m✓[39m [34mparsnip [39m 0.1.5 [32m✓[39m [34myardstick[39m 0.0.7
## [32m✓[39m [34mpurrr [39m 0.3.4
## ── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
## [31mx[39m [34mpurrr[39m::[32mdiscard()[39m masks [34mscales[39m::discard()
## [31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
## [31mx[39m [34mdplyr[39m::[32mlag()[39m masks [34mstats[39m::lag()
## [31mx[39m [34mrecipes[39m::[32mstep()[39m masks [34mstats[39m::step()
These are the individual steps and how long each step takes:
tar_meta() %>%
select(name, seconds) %>%
kableExtra::kable()
| name | seconds |
|---|---|
| ames_raw | 0.218 |
| ames_cleaned | 0.022 |
| ames_split | 0.030 |
| ames_train | 0.003 |
| ames_recipe | 0.020 |
| workflow | 0.045 |
| fitted_models | 4.222 |
| report | 22.899 |
| make_workflow_sets | NA |
| make_ames_recipe | NA |
| fit_models | NA |
| lm_model | 0.001 |
| ames_metrics | 1.186 |
| xgb_model | 0.005 |
| ames_test | 0.002 |
| rf_model | 0.001 |
| models | 0.001 |
| model_names | 0.001 |
| predicted | 0.222 |
| pred_actual | 0.002 |
| eval | 0.027 |
tar_read(ames_raw) %>%
skim()
| Name | Piped data |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 40 |
| numeric | 34 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MS_SubClass | 0 | 1 | 11 | 41 | 0 | 16 | 0 |
| MS_Zoning | 0 | 1 | 5 | 28 | 0 | 7 | 0 |
| Street | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
| Alley | 0 | 1 | 5 | 15 | 0 | 3 | 0 |
| Lot_Shape | 0 | 1 | 7 | 20 | 0 | 4 | 0 |
| Land_Contour | 0 | 1 | 3 | 3 | 0 | 4 | 0 |
| Utilities | 0 | 1 | 6 | 6 | 0 | 3 | 0 |
| Lot_Config | 0 | 1 | 3 | 7 | 0 | 5 | 0 |
| Land_Slope | 0 | 1 | 3 | 3 | 0 | 3 | 0 |
| Neighborhood | 0 | 1 | 6 | 39 | 0 | 28 | 0 |
| Condition_1 | 0 | 1 | 4 | 6 | 0 | 9 | 0 |
| Condition_2 | 0 | 1 | 4 | 6 | 0 | 8 | 0 |
| Bldg_Type | 0 | 1 | 5 | 8 | 0 | 5 | 0 |
| House_Style | 0 | 1 | 4 | 16 | 0 | 8 | 0 |
| Overall_Cond | 0 | 1 | 4 | 13 | 0 | 9 | 0 |
| Roof_Style | 0 | 1 | 3 | 7 | 0 | 6 | 0 |
| Roof_Matl | 0 | 1 | 4 | 7 | 0 | 8 | 0 |
| Exterior_1st | 0 | 1 | 5 | 7 | 0 | 16 | 0 |
| Exterior_2nd | 0 | 1 | 5 | 7 | 0 | 17 | 0 |
| Mas_Vnr_Type | 0 | 1 | 4 | 7 | 0 | 5 | 0 |
| Exter_Cond | 0 | 1 | 4 | 9 | 0 | 5 | 0 |
| Foundation | 0 | 1 | 4 | 6 | 0 | 6 | 0 |
| Bsmt_Cond | 0 | 1 | 4 | 11 | 0 | 6 | 0 |
| Bsmt_Exposure | 0 | 1 | 2 | 11 | 0 | 5 | 0 |
| BsmtFin_Type_1 | 0 | 1 | 3 | 11 | 0 | 7 | 0 |
| BsmtFin_Type_2 | 0 | 1 | 3 | 11 | 0 | 7 | 0 |
| Heating | 0 | 1 | 4 | 5 | 0 | 6 | 0 |
| Heating_QC | 0 | 1 | 4 | 9 | 0 | 5 | 0 |
| Central_Air | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| Electrical | 0 | 1 | 3 | 7 | 0 | 6 | 0 |
| Functional | 0 | 1 | 3 | 4 | 0 | 8 | 0 |
| Garage_Type | 0 | 1 | 6 | 19 | 0 | 7 | 0 |
| Garage_Finish | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| Garage_Cond | 0 | 1 | 4 | 9 | 0 | 6 | 0 |
| Paved_Drive | 0 | 1 | 5 | 16 | 0 | 3 | 0 |
| Pool_QC | 0 | 1 | 4 | 9 | 0 | 5 | 0 |
| Fence | 0 | 1 | 8 | 17 | 0 | 5 | 0 |
| Misc_Feature | 0 | 1 | 4 | 4 | 0 | 6 | 0 |
| Sale_Type | 0 | 1 | 2 | 5 | 0 | 10 | 0 |
| Sale_Condition | 0 | 1 | 6 | 7 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Lot_Frontage | 0 | 1 | 57.65 | 33.50 | 0.00 | 43.00 | 63.00 | 78.00 | 313.00 | ▇▇▁▁▁ |
| Lot_Area | 0 | 1 | 10147.92 | 7880.02 | 1300.00 | 7440.25 | 9436.50 | 11555.25 | 215245.00 | ▇▁▁▁▁ |
| Year_Built | 0 | 1 | 1971.36 | 30.25 | 1872.00 | 1954.00 | 1973.00 | 2001.00 | 2010.00 | ▁▂▃▆▇ |
| Year_Remod_Add | 0 | 1 | 1984.27 | 20.86 | 1950.00 | 1965.00 | 1993.00 | 2004.00 | 2010.00 | ▅▂▂▃▇ |
| Mas_Vnr_Area | 0 | 1 | 101.10 | 178.63 | 0.00 | 0.00 | 0.00 | 162.75 | 1600.00 | ▇▁▁▁▁ |
| BsmtFin_SF_1 | 0 | 1 | 4.18 | 2.23 | 0.00 | 3.00 | 3.00 | 7.00 | 7.00 | ▃▂▇▁▇ |
| BsmtFin_SF_2 | 0 | 1 | 49.71 | 169.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1526.00 | ▇▁▁▁▁ |
| Bsmt_Unf_SF | 0 | 1 | 559.07 | 439.54 | 0.00 | 219.00 | 465.50 | 801.75 | 2336.00 | ▇▅▂▁▁ |
| Total_Bsmt_SF | 0 | 1 | 1051.26 | 440.97 | 0.00 | 793.00 | 990.00 | 1301.50 | 6110.00 | ▇▃▁▁▁ |
| First_Flr_SF | 0 | 1 | 1159.56 | 391.89 | 334.00 | 876.25 | 1084.00 | 1384.00 | 5095.00 | ▇▃▁▁▁ |
| Second_Flr_SF | 0 | 1 | 335.46 | 428.40 | 0.00 | 0.00 | 0.00 | 703.75 | 2065.00 | ▇▃▂▁▁ |
| Gr_Liv_Area | 0 | 1 | 1499.69 | 505.51 | 334.00 | 1126.00 | 1442.00 | 1742.75 | 5642.00 | ▇▇▁▁▁ |
| Bsmt_Full_Bath | 0 | 1 | 0.43 | 0.52 | 0.00 | 0.00 | 0.00 | 1.00 | 3.00 | ▇▆▁▁▁ |
| Bsmt_Half_Bath | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| Full_Bath | 0 | 1 | 1.57 | 0.55 | 0.00 | 1.00 | 2.00 | 2.00 | 4.00 | ▁▇▇▁▁ |
| Half_Bath | 0 | 1 | 0.38 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 2.00 | ▇▁▅▁▁ |
| Bedroom_AbvGr | 0 | 1 | 2.85 | 0.83 | 0.00 | 2.00 | 3.00 | 3.00 | 8.00 | ▁▇▂▁▁ |
| Kitchen_AbvGr | 0 | 1 | 1.04 | 0.21 | 0.00 | 1.00 | 1.00 | 1.00 | 3.00 | ▁▇▁▁▁ |
| TotRms_AbvGrd | 0 | 1 | 6.44 | 1.57 | 2.00 | 5.00 | 6.00 | 7.00 | 15.00 | ▁▇▂▁▁ |
| Fireplaces | 0 | 1 | 0.60 | 0.65 | 0.00 | 0.00 | 1.00 | 1.00 | 4.00 | ▇▇▁▁▁ |
| Garage_Cars | 0 | 1 | 1.77 | 0.76 | 0.00 | 1.00 | 2.00 | 2.00 | 5.00 | ▅▇▂▁▁ |
| Garage_Area | 0 | 1 | 472.66 | 215.19 | 0.00 | 320.00 | 480.00 | 576.00 | 1488.00 | ▃▇▃▁▁ |
| Wood_Deck_SF | 0 | 1 | 93.75 | 126.36 | 0.00 | 0.00 | 0.00 | 168.00 | 1424.00 | ▇▁▁▁▁ |
| Open_Porch_SF | 0 | 1 | 47.53 | 67.48 | 0.00 | 0.00 | 27.00 | 70.00 | 742.00 | ▇▁▁▁▁ |
| Enclosed_Porch | 0 | 1 | 23.01 | 64.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1012.00 | ▇▁▁▁▁ |
| Three_season_porch | 0 | 1 | 2.59 | 25.14 | 0.00 | 0.00 | 0.00 | 0.00 | 508.00 | ▇▁▁▁▁ |
| Screen_Porch | 0 | 1 | 16.00 | 56.09 | 0.00 | 0.00 | 0.00 | 0.00 | 576.00 | ▇▁▁▁▁ |
| Pool_Area | 0 | 1 | 2.24 | 35.60 | 0.00 | 0.00 | 0.00 | 0.00 | 800.00 | ▇▁▁▁▁ |
| Misc_Val | 0 | 1 | 50.64 | 566.34 | 0.00 | 0.00 | 0.00 | 0.00 | 17000.00 | ▇▁▁▁▁ |
| Mo_Sold | 0 | 1 | 6.22 | 2.71 | 1.00 | 4.00 | 6.00 | 8.00 | 12.00 | ▅▆▇▃▃ |
| Year_Sold | 0 | 1 | 2007.79 | 1.32 | 2006.00 | 2007.00 | 2008.00 | 2009.00 | 2010.00 | ▇▇▇▇▃ |
| Sale_Price | 0 | 1 | 180796.06 | 79886.69 | 12789.00 | 129500.00 | 160000.00 | 213500.00 | 755000.00 | ▇▇▁▁▁ |
| Longitude | 0 | 1 | -93.64 | 0.03 | -93.69 | -93.66 | -93.64 | -93.62 | -93.58 | ▅▅▇▆▁ |
| Latitude | 0 | 1 | 42.03 | 0.02 | 41.99 | 42.02 | 42.03 | 42.05 | 42.06 | ▂▂▇▇▇ |
tar_read(ames_raw) %>%
select_if(is.numeric) %>%
correlate() %>% # Create correlation data frame (cor_df)
rearrange() %>% # rearrange by correlations
shave() %>%
rplot()
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
Before cleaning:
tar_read(ames_raw) %>%
select(Sale_Price, Gr_Liv_Area, Year_Built, Bldg_Type, Latitude, Longitude) %>%
ggpairs()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
After cleaning:
tar_read(ames_cleaned) %>%
select(Sale_Price, Gr_Liv_Area, Year_Built, Bldg_Type, Latitude, Longitude) %>%
ggpairs()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s take a look at the predictions vs. observed for our models.
tar_read(pred_actual) %>%
pivot_longer(c(-Sale_Price)) %>%
ggplot(aes(Sale_Price, value, col = name)) + geom_point() + geom_abline(intercept =0 , slope = 1) + scale_x_continuous(limits = c(4.5, NA)) + scale_y_continuous(limits = c(4.5, NA)) + facet_grid(name ~ .) + labs(title = "Predicted vs. Actual for Each Model", x = "Actual", y = "Predicted")
## Warning: Removed 3 rows containing missing values (geom_point).
Residuals vs. observed for each model:
tar_read(pred_actual) %>%
pivot_longer(c(-Sale_Price)) %>%
mutate(value = value - Sale_Price) %>%
ggplot(aes(Sale_Price, value, col = name)) + geom_point() + geom_hline(yintercept = 0) + facet_grid(name ~.) + labs(title = "Actual vs. Residuals for Each Model", x = "Actual", y = "Residual")
tar_read(eval) %>%
ggplot(aes(model, .estimate)) + geom_point() + facet_wrap(.metric ~., scales = "free") + coord_flip()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GGally_2.1.1 corrr_0.4.3 skimr_2.1.3 workflowsets_0.0.2
## [5] forcats_0.5.0 stringr_1.4.0 readr_1.4.0 tidyverse_1.3.0
## [9] yardstick_0.0.7 workflows_0.2.2 tune_0.1.5 tidyr_1.1.2
## [13] tibble_3.1.1 rsample_0.0.9 recipes_0.1.16 purrr_0.3.4
## [17] parsnip_0.1.5 modeldata_0.1.0 infer_0.5.3 ggplot2_3.3.2
## [21] dplyr_1.0.2 dials_0.0.9 scales_1.1.1 broom_0.7.3
## [25] tidymodels_0.1.2 tarchetypes_0.0.4 targets_0.1.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-0 ellipsis_0.3.1 class_7.3-17 base64enc_0.1-3
## [5] fs_1.5.0 rstudioapi_0.13 farver_2.0.3 listenv_0.8.0
## [9] furrr_0.2.1 prodlim_2019.11.13 fansi_0.4.1 lubridate_1.7.9.2
## [13] xml2_1.3.2 codetools_0.2-18 splines_4.0.3 knitr_1.30
## [17] jsonlite_1.7.2 pROC_1.16.2 dbplyr_2.0.0 compiler_4.0.3
## [21] httr_1.4.2 backports_1.2.1 assertthat_0.2.1 Matrix_1.2-18
## [25] cli_2.2.0 visNetwork_2.0.9 htmltools_0.5.1.1 tools_4.0.3
## [29] igraph_1.2.6 gtable_0.3.0 glue_1.4.2 Rcpp_1.0.5
## [33] cellranger_1.1.0 DiceDesign_1.8-1 vctrs_0.3.6 iterators_1.0.13
## [37] timeDate_3043.102 gower_0.2.2 xfun_0.20 globals_0.14.0
## [41] ps_1.5.0 rvest_0.3.6 lifecycle_0.2.0 future_1.21.0
## [45] MASS_7.3-53 TSP_1.1-10 ipred_0.9-9 hms_0.5.3
## [49] parallel_4.0.3 RColorBrewer_1.1-2 yaml_2.2.1 rpart_4.1-15
## [53] reshape_0.8.8 stringi_1.5.3 highr_0.8 foreach_1.5.1
## [57] seriation_1.2-9 lhs_1.1.1 lava_1.6.8.1 repr_1.1.3
## [61] rlang_0.4.10 pkgconfig_2.0.3 evaluate_0.14 lattice_0.20-41
## [65] labeling_0.4.2 htmlwidgets_1.5.3 processx_3.4.5 tidyselect_1.1.0
## [69] parallelly_1.22.0 plyr_1.8.6 magrittr_2.0.1 R6_2.5.0
## [73] generics_0.1.0 DBI_1.1.0 pillar_1.6.0 haven_2.3.1
## [77] withr_2.3.0 survival_3.2-7 nnet_7.3-14 modelr_0.1.8
## [81] crayon_1.3.4 utf8_1.1.4 rmarkdown_2.6 grid_4.0.3
## [85] readxl_1.3.1 data.table_1.13.4 callr_3.5.1 webshot_0.5.2
## [89] reprex_0.3.0 digest_0.6.27 GPfit_1.0-8 munsell_0.5.0
## [93] registry_0.5-1 viridisLite_0.3.0 kableExtra_1.3.1